Challenges in Persian Electronic Text Analysis
نویسندگان
چکیده
Farsi, also known as Persian, is the official language of Iran and Tajikistan and one of the two main languages spoken in Afghanistan. Farsi enjoys a unified Arabic script as its writing system. In this paper we briefly introduce the writing standards of Farsi and highlight problems one would face when analyzing Farsi electronic texts, especially during development of Farsi corpora regarding to transcription and encoding of Farsi e-texts. The pointes mentioned may sounds easy but they are crucial when developing and processing written corpora of Farsi.
منابع مشابه
Sentiment analysis methods in Sentiment analysis methods in Persian text: A survey
With the explosive growth of social media such as Twitter, reviews on e-commerce website, and comments on news websites, individuals and organizations are increasingly using opinions in these media for their decision making. Sentiment analysis is one of the techniques used to analyze userschr('39') opinions in recent years. Persian language has specific features and thereby requires unique meth...
متن کاملSTeP-1: A Set of Fundamental Tools for Persian Text Processing
Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex prepr...
متن کاملPersian/Arabic Document Segmentation Based On Pyramidal Image Structure
Automatic transformation of paper documents into electronic documents requires document segmentation at the first stage. However, some parameters restrictions such as variations in character font sizes, different text line spacing, and also not uniform document layout structures altogether have made it difficult to design a general-purpose document layout analysis algorithm for many years. Thus...
متن کاملSemiotic Analysis of Written Signs in the Road Sign Systems of Tehran City
Introduction: as a component of the urban landscape, road sign systems are among the most critical elements of urban environments. Generally speaking, the written signs dominate the design of these systems. These signs can also foster aesthetic and visual pleasure compellingly and innovatively. Furthermore, they perpetuate a specific image in the minds of their observers. This research seeks to...
متن کاملA Study of Corpus Development for Persian
Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian are discussed. The sources of the problems are the Persian scrip...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1404.4740 شماره
صفحات -
تاریخ انتشار 2014